Industrial agriculture is threatened by a changing climate. With the current methods under threat and the environmental factors becoming increasingly difficult to mitigate, solutions are needed. One area of agriculture that is taking heavy damage is soil health. With pre-existing farming methods like tilling, fertilizer, and pesticide use, soil health is in steady decline. In this study, we identify the croplands of Bismarck, North Dakota, and determine how sustainable agriculture methods affect soil quality. Using indicators like soil quality index scores (SQIs), Beta-Glucosidase Activity (BGLU), and Bacteria Maker FAME (BACSUM), we utilized a machine learning workflow to determine how the practices affect soil health. Our results suggest that cover cropping, diversified rotations, and grazed continuous cropping support better soil health compared to fallow-based methods. Traditional rotation practices do not show an increase in soil recovery and restoration of organic matter. While these practices are not detrimental to existing agricultural land, the switch to more sustainable alternatives is imperative for adapting to the outcomes of climate change. Recognizing how soil health responds to different management practices can serve as a starting point for further research in different agrosystems outside of North Dakota.
Introduction/Hypothesis
As climate change worsens, attention to industrial agriculture increases. The specific methods of the industry, like tilling, monocropping, and fertilizer and pesticide use, pose a threat to soil health (Belete & Yadete (2023)). These practices, while they may be efficient and produce steady yields, have further environmental implications that are not sustainable in the long run (Worley (2024)). Soil health is critical to food and water security, as well as overall ecosystem function. In ecosystems, soil organic carbon (SOC) is fundamental in the processes of water permeation, nutrient retention, and aggregate stability. As Zhao et al. state, “Even a small change in the SOC can substantially affect not only the climate but also the stability of ecosystems, because of its decisive role in the exchange of carbon between the soil and atmosphere and plant growth/food production”(Zhao et al. (2021)). Studies have also supported that areas with less anthropogenic activity show low bulk density within the soil. (Cardoso et al. (2013)). Additionally, areas with less human activity tend to have a better accumulation of soil particles, therefore a better structure within the soil. This improved structure is vital for all ecosystems as it allows the permeability for water, air, and roots to be eased, helping them thrive.
There needs to be a shift to sustainable agriculture methods to restore and protect soil health. Sustaining soil health while simultaneously sustaining crop yields is the goal that the agriculture industry should be working towards. In addition, (Katherasala et al. (2024)) emphasize the importance of educating farmers on the harmful effect of current practices and alternative methods. Despite the need for sustainable practices, there is cause for concern and uncertainty towards transitioning to more sustainable agriculture, with most concerns being centered around changes in crop yields and production efficiency. However, (Miner et al. (2020)) provides research showing that under no-tilling conditions, yield declines can be partially mitigated when no-till conditions are combined with residue retention and crop rotation practices. These concerns can also be minimized with the use of soil quality indices (SQIs). SQIs are utilized to measure and monitor soil function and health through chemical and biological properties (Chaudhry et al. (2024)). With practices like crop rotation, residue retention, cover cropping, and livestock integration the SQIs can be determined to monitor soil health with these treatments and ensure it is beneficial for agricultural development. This study looked at different models to determine if crop rotation systems that integrate livestock and cover crops in the Northern Great Plains will result in significantly higher soil quality index scores in relation to microbial activity.
Methods
The data used for our research was collected in the USDA-ARS Northern Great Plains Research Lab from the years 2014-2016. We accessed this data through data.gov from the metadata set called “Conservation Practices Induce Tradeoffs in Soil Function: Observations from the Northern Great Plains” (Liebig et al. (2022)). We downloaded and utilized two of its dataset. The first, SES_On_Station_Intergrative Measures_SMAF, contained different soil scores across different land use types and treatment systems. While the second one, SES_On_Station_Soil_Properties_Near Surface, included measurements for various soil properties for each land use and treatment type. Both of which reflected near-surface (0-5 cm) depth data. After downloading the datasets, we created a new project in RStudio. Once created, we read in each csv file using the read_csv() function, saving each raw dataset to an object. The datasets were then inspected for any shared columns using the glimps() function. We were then able to join the two datasets by their overlapping columns “STUDY”, “LTEXP”, “TRT”, “LAND_USE”, “YEAR”, “REP”, “DEPTH”.
Through Exploratory Data Analysis, we were able to understand the merged data structure. This included identifying outliers, creating visualizations, and getting summary statistics. First we used glimps() to check data types, dimensions, and missing values. To further understand what values were missing. There were two columns that showed a high percentage of missing values, TOC and VESS, to address this we filtered the columns out and created a new dataset with the cleaned data. Then, to view some statistical summaries on our data we used the skim() function, allowing us to view the general trend in mean, standard deviation, and skewness for each variable. As a preprocessor step for correlation analysis, we created a new dataset with only numeric variables. This separation of numeric data allowed us to narrow down the variables that had a strong correlation to the soil quality index score (SMAFQI) variable and to each other. Both vis_cor() and cor() were used for this process. Due to their high correlation with each other, WSA, UREASE, SMAFQI, BGLU, TFAME, PHOS, and BACSUM were all selected to do a separate correlation analysis on. We then visualized each variable’s distribution using the gghistogram() function. Both Beta-Glucosidase Activity (BGLU) and Bacteria Maker FAME (BACSUM) were then selected to complete our remaining research on.
For categorical analysis, we used a box plot, density plot, and violin plots to visualize the relationship between each SMAFQI, BGLU, and BACSUM and treatment (LAND_USE) or management type (TRT). Next, to test for normality in SMAFQI, BGLU, and BACSUM, we used the Shapiro-Wilk normality test. The results informed us that our data did not follow a normal distribution. To help normalize the data, we log-transformed each variable. To verify that the log transformation worked, we used the Shapiro-Wilk normality test again, using the transformed data. Additionally, we viewed the transformation visually using gghistogram(). Both BGLU and BACSUM appeared normally distributed, while SMAFQI still remained skewed. Thus, Pearson’s correlation test was used to test BGLU and BACSUM individually, whereas the Spearman test was used to assess the correlation of the variables without log transforming them. Furthermore, to ensure that SMAFQI, BGLU, and BACSUM were correlated to the different management practices, the Kruskal-Wallis rank sum test was used. To visualize the log-log relationship between BGLU and BACSUM, and view how well both variables predict SMAFQI change, an XY plot of BGLU and BACSUM was made. The scale_color_viridis_c() function was used to color the points by SMAFQI.
We then began a machine learning workflow. First, we set a seed for reproducibility and split the data into an 80% training and 20% testing set. Additionally, a 10-fold cross-validation dataset was created. To preprocess the data recipe() was used to define a series of data preprocessing steps. Next, five different models were built, a workflow_set() was created using the 5 models, and each one was fit to the training data. Their performance was compared using autoplot() and rank_results(). The highest performing model, which was the nearest neighbor model, was then used to build a final workflow on the full training data. The augment() function was then used to make predictions on the test data. To evaluate the final model’s performance metrics() was used and an observed vs predicted scatter plot was made.
Results
Code
# Load and inspect datalibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
STUDY LTEXP TRT LAND_USE
Length:72 Length:72 Length:72 Length:72
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
YEAR REP DEPTH TOCs
Min. :2014 Min. :1.00 Length:72 Min. :0.3190
1st Qu.:2014 1st Qu.:1.75 Class :character 1st Qu.:0.5455
Median :2015 Median :2.50 Mode :character Median :0.6360
Mean :2015 Mean :2.50 Mean :0.6229
3rd Qu.:2016 3rd Qu.:3.25 3rd Qu.:0.7392
Max. :2016 Max. :4.00 Max. :0.8610
AGGs MBCs PMNs PHs
Min. :0.4160 Min. :0.8780 Min. :0.2070 Min. :0.5800
1st Qu.:0.8905 1st Qu.:0.9928 1st Qu.:0.6813 1st Qu.:0.8400
Median :1.0000 Median :0.9990 Median :0.9375 Median :0.9180
Mean :0.9092 Mean :0.9892 Mean :0.8122 Mean :0.8924
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.9925 3rd Qu.:0.9623
Max. :1.0060 Max. :1.0000 Max. :1.0000 Max. :1.0000
Ps QCO2s SBDs Ecs
Min. :0.9020 Min. :1 Min. :0.7250 Min. :0.8480
1st Qu.:0.9700 1st Qu.:1 1st Qu.:0.9750 1st Qu.:1.0000
Median :1.0000 Median :1 Median :0.9925 Median :1.0000
Mean :0.9848 Mean :1 Mean :0.9723 Mean :0.9971
3rd Qu.:1.0000 3rd Qu.:1 3rd Qu.:0.9940 3rd Qu.:1.0000
Max. :1.0000 Max. :1 Max. :0.9940 Max. :1.0000
AWCs BGs Ks WFPSs SMAFSQI
Min. :0.6490 Min. :0.0630 Min. :1 Min. :0.6110 Min. :75.35
1st Qu.:0.7488 1st Qu.:0.2122 1st Qu.:1 1st Qu.:0.9315 1st Qu.:84.09
Median :0.7755 Median :0.4705 Median :1 Median :0.9620 Median :88.86
Mean :0.7758 Mean :0.5118 Mean :1 Mean :0.9399 Mean :87.75
3rd Qu.:0.8110 3rd Qu.:0.8648 3rd Qu.:1 3rd Qu.:0.9740 3rd Qu.:93.15
Max. :0.8560 Max. :1.0080 Max. :1 Max. :0.9790 Max. :95.82
SBD POR SORP AWHC
Min. :0.980 Min. :0.4900 Min. :0.0440 Min. :0.1380
1st Qu.:1.090 1st Qu.:0.5400 1st Qu.:0.1087 1st Qu.:0.1660
Median :1.155 Median :0.5650 Median :0.1380 Median :0.1755
Mean :1.150 Mean :0.5658 Mean :0.1649 Mean :0.1765
3rd Qu.:1.210 3rd Qu.:0.5900 3rd Qu.:0.2135 3rd Qu.:0.1890
Max. :1.350 Max. :0.6300 Max. :0.5000 Max. :0.2100
WSA PH SLAN CEC
Min. :12.13 Min. :4.410 Min. : 65.5 Min. :13.20
1st Qu.:36.31 1st Qu.:5.340 1st Qu.:132.8 1st Qu.:18.48
Median :50.76 Median :5.700 Median :177.8 Median :19.90
Mean :48.28 Mean :5.668 Mean :184.0 Mean :19.77
3rd Qu.:61.44 3rd Qu.:5.982 3rd Qu.:224.4 3rd Qu.:20.93
Max. :79.00 Max. :6.700 Max. :394.5 Max. :26.10
UREASE PHOS BGLU TFAME
Min. : 11.78 Min. : 7.15 Min. : 98.62 Min. :192.6
1st Qu.: 44.77 1st Qu.: 59.17 1st Qu.:216.91 1st Qu.:314.9
Median : 85.78 Median :125.06 Median :317.92 Median :390.6
Mean : 90.27 Mean :149.94 Mean :353.35 Mean :421.9
3rd Qu.:115.15 3rd Qu.:201.10 3rd Qu.:481.12 3rd Qu.:530.8
Max. :216.25 Max. :585.56 Max. :878.97 Max. :959.5
BACSUM FUNSUM NO3N WFPS
Min. : 49.18 Min. : 29.36 Min. : 0.880 Min. :0.2800
1st Qu.: 83.86 1st Qu.: 57.05 1st Qu.: 6.513 1st Qu.:0.4700
Median :103.28 Median : 69.86 Median :11.660 Median :0.5300
Mean :115.69 Mean : 75.98 Mean :15.834 Mean :0.5253
3rd Qu.:145.84 3rd Qu.: 95.68 3rd Qu.:19.545 3rd Qu.:0.5825
Max. :281.81 Max. :150.45 Max. :80.540 Max. :0.7300
Code
library(skimr)skim(data_clean)
Data summary
Name
data_clean
Number of rows
72
Number of columns
37
_______________________
Column type frequency:
character
5
numeric
32
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
STUDY
0
1
23
23
0
1
0
LTEXP
0
1
19
31
0
3
0
TRT
0
1
3
7
0
2
0
LAND_USE
0
1
19
33
0
6
0
DEPTH
0
1
4
4
0
1
0
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
YEAR
0
1
2015.00
0.82
2014.00
2014.00
2015.00
2016.00
2016.00
▇▁▇▁▇
REP
0
1
2.50
1.13
1.00
1.75
2.50
3.25
4.00
▇▇▁▇▇
TOCs
0
1
0.62
0.16
0.32
0.55
0.64
0.74
0.86
▅▂▇▆▇
AGGs
0
1
0.91
0.16
0.42
0.89
1.00
1.00
1.01
▁▁▁▁▇
MBCs
0
1
0.99
0.02
0.88
0.99
1.00
1.00
1.00
▁▁▁▁▇
PMNs
0
1
0.81
0.24
0.21
0.68
0.94
0.99
1.00
▁▂▁▂▇
PHs
0
1
0.89
0.10
0.58
0.84
0.92
0.96
1.00
▁▂▁▃▇
Ps
0
1
0.98
0.02
0.90
0.97
1.00
1.00
1.00
▁▁▁▃▇
QCO2s
0
1
1.00
0.00
1.00
1.00
1.00
1.00
1.00
▁▁▇▁▁
SBDs
0
1
0.97
0.05
0.72
0.98
0.99
0.99
0.99
▁▁▁▁▇
Ecs
0
1
1.00
0.02
0.85
1.00
1.00
1.00
1.00
▁▁▁▁▇
AWCs
0
1
0.78
0.04
0.65
0.75
0.78
0.81
0.86
▁▂▇▇▅
BGs
0
1
0.51
0.34
0.06
0.21
0.47
0.86
1.01
▇▃▂▂▇
Ks
0
1
1.00
0.00
1.00
1.00
1.00
1.00
1.00
▁▁▇▁▁
WFPSs
0
1
0.94
0.06
0.61
0.93
0.96
0.97
0.98
▁▁▁▂▇
SMAFSQI
0
1
87.75
5.91
75.35
84.09
88.86
93.15
95.82
▃▂▅▅▇
SBD
0
1
1.15
0.09
0.98
1.09
1.15
1.21
1.35
▃▆▇▆▁
POR
0
1
0.57
0.03
0.49
0.54
0.56
0.59
0.63
▁▆▇▆▃
SORP
0
1
0.16
0.09
0.04
0.11
0.14
0.21
0.50
▇▆▂▁▁
AWHC
0
1
0.18
0.02
0.14
0.17
0.18
0.19
0.21
▁▅▇▅▃
WSA
0
1
48.28
17.69
12.13
36.31
50.76
61.44
79.00
▃▅▇▇▅
PH
0
1
5.67
0.48
4.41
5.34
5.70
5.98
6.70
▁▃▇▇▂
SLAN
0
1
183.99
62.63
65.50
132.78
177.80
224.44
394.50
▆▇▆▂▁
CEC
0
1
19.77
1.90
13.20
18.48
19.90
20.92
26.10
▁▂▇▃▁
UREASE
0
1
90.27
51.72
11.78
44.77
85.78
115.15
216.25
▇▇▆▃▂
PHOS
0
1
149.94
121.37
7.15
59.17
125.06
201.10
585.56
▇▆▂▁▁
BGLU
0
1
353.35
189.55
98.62
216.92
317.92
481.12
878.97
▇▅▆▁▁
TFAME
0
1
421.90
158.00
192.62
314.88
390.56
530.79
959.53
▇▆▃▁▁
BACSUM
0
1
115.69
43.41
49.18
83.86
103.28
145.84
281.81
▇▇▅▁▁
FUNSUM
0
1
75.98
27.51
29.36
57.05
69.85
95.68
150.45
▃▇▃▃▁
NO3N
0
1
15.83
15.89
0.88
6.51
11.66
19.55
80.54
▇▂▁▁▁
WFPS
0
1
0.53
0.09
0.28
0.47
0.53
0.58
0.73
▁▂▇▇▁
Code
# Separating numeric data into a new datasetnumeric_data <- data_clean |>select(where(is.numeric))vis_cor(numeric_data)
Warning in stats::cor(x, method = cor_method, use = na_action): the standard
deviation is zero
library(ggpubr)ggarrange(gghistogram(data_clean$WSA, title ="WSA", bins =20), gghistogram(data_clean$UREASE, title ="UREASE", bins =20), gghistogram(data_clean$SMAFSQI, title ="SMAFSQI", bins =20), gghistogram(data_clean$BGLU, title ="BGLU", bins =20), gghistogram(data_clean$TFAME, title ="TFAME", bins =20), gghistogram(data_clean$PHOS, title ="PHOS", bins =20), gghistogram(data_clean$BACSUM, title ="BACSUM", bins =20))
Code
# boxplot plot of soil quality index by long-term experiment ggplot(data_clean, aes(x = LTEXP, y = SMAFSQI)) +geom_boxplot(fill ="skyblue", color ="black") +theme_minimal() +labs(title ="Soil Quality Index by Management Practice",x ="Management Practice (LTEXP)",y ="SMAFSQI" ) +theme(axis.text.x =element_text(angle =45, hjust =1))
Code
# density plot of variables by management practice/treatment typeggplot(data_clean, aes(x = SMAFSQI, fill = LTEXP)) +geom_density(alpha =0.5) +theme_minimal() +labs(title ="Density of SMAFSQI by Management Practice",x ="Soil Quality Index (SMAFSQI)",fill ="Management Practice" )
Code
ggplot(data_clean, aes(x = BGLU, fill = LTEXP)) +geom_density(alpha =0.5) +theme_minimal() +labs(title ="Density of SMAFSQI by Management Practice",x ="BGLU",fill ="Management Practice" )
Code
ggplot(data_clean, aes(x = BACSUM, fill = LTEXP)) +geom_density(alpha =0.5) +theme_minimal() +labs(title ="Density of SMAFSQI by Management Practice",x ="BACSUM",fill ="Management Practice" )
Code
# violin plotsSMAFSQI_type_violin <-ggviolin(data_clean, x ="TRT", y ="SMAFSQI", add ="boxplot", color ="TRT")BGLU_type_violin <-ggviolin(data_clean, x ="TRT", y ="BGLU", add ="boxplot", color ="TRT")BACSUM_type_violin <-ggviolin(data_clean, x ="TRT", y ="BACSUM", add ="boxplot", color ="TRT")SMAFSQI_landuse_violin <-ggviolin(data_clean, x ="LAND_USE", y ="SMAFSQI", add ="boxplot", color ="LAND_USE")BGLU_landuse_violin <-ggviolin(data_clean, x ="LAND_USE", y ="BGLU", add ="boxplot", color ="LAND_USE")BACSUM_landuse_violin <-ggviolin(data_clean, x ="LAND_USE", y ="BACSUM", add ="boxplot", color ="LAND_USE")library(patchwork)combined_violins <- SMAFSQI_landuse_violin / BGLU_landuse_violin / BACSUM_landuse_violinggsave("images/combined_violins.png", plot = combined_violins, width =18, height =23)
Violin plots (Figure 1) further showed how BGLU, BACSUM, and SMAFSQI were spread out across LAND_USE. In near-surface soil, grazed and ungrazed continuous cropping showed high median values and narrower distributions for SMAFSQI. Likewise, for BGLU and BACSUM, grazed continuous cropping promoted higher β-glucosidase activity and bacterial activity. Hence, Figure 5 illustrates that continuous cropping, especially when grazed, consistently supports higher soil quality. In contrast, across all three indicators (BGLU, BACSUM, and SMAFSQI), spring wheat-fallow systems displayed the lowest medians with a wider spread. Signifying that fallow-based systems may reduce microbial activity and soil quality.
Shapiro-Wilk normality test
data: data_clean$SMAFSQI
W = 0.933, p-value = 0.0008572
Code
shapiro.test(data_clean$BGLU)
Shapiro-Wilk normality test
data: data_clean$BGLU
W = 0.9281, p-value = 0.0005017
Code
shapiro.test(data_clean$BACSUM)
Shapiro-Wilk normality test
data: data_clean$BACSUM
W = 0.9307, p-value = 0.0006649
Code
# normality test on log transformed variables shapiro.test(log(data_clean$SMAFSQI))
Shapiro-Wilk normality test
data: log(data_clean$SMAFSQI)
W = 0.92437, p-value = 0.0003373
Code
shapiro.test(log(data_clean$BGLU))
Shapiro-Wilk normality test
data: log(data_clean$BGLU)
W = 0.97296, p-value = 0.1231
Code
shapiro.test(log(data_clean$BACSUM))
Shapiro-Wilk normality test
data: log(data_clean$BACSUM)
W = 0.98818, p-value = 0.7397
Code
ggarrange(gghistogram(log(data_clean$SMAFSQI), title ="SMAFSQI", bins =20), gghistogram(log(data_clean$BGLU), title ="BGLU", bins =20), gghistogram(log(data_clean$BACSUM), title ="BACSUM", bins =20))
Code
# Correlation# Pearson’s correlation test, works on normally distributed datacor.test(log(data_clean$BGLU), log(data_clean$BACSUM))
Pearson's product-moment correlation
data: log(data_clean$BGLU) and log(data_clean$BACSUM)
t = 6.2409, df = 70, p-value = 2.928e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.4251322 0.7286502
sample estimates:
cor
0.5979125
Code
# Spearman correlation test, works on non-parametric and non-linear data setscor.test(data_clean$BGLU, data_clean$BACSUM, method ="spearman")
Spearman's rank correlation rho
data: data_clean$BGLU and data_clean$BACSUM
S = 28654, p-value = 1.544e-06
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.5392951
Warning in cor.test.default(data_clean$BGLU, data_clean$SMAFSQI, method =
"spearman"): Cannot compute exact p-value with ties
Spearman's rank correlation rho
data: data_clean$BGLU and data_clean$SMAFSQI
S = 7232.6, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.8837135
Warning in cor.test.default(data_clean$SMAFSQI, data_clean$BACSUM, method =
"spearman"): Cannot compute exact p-value with ties
Spearman's rank correlation rho
data: data_clean$SMAFSQI and data_clean$BACSUM
S = 25712, p-value = 6.137e-08
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.5866019
Spearman’s rank correlation revealed moderately strong relationships between the variables BGLU, BACSUM, and SMAFSQI. Beta-Glucosidase Activity (BGLU) had a strong positive correlation with Soil Quality Index Score (SMAFSQI) (ρ = 0.88, p < 2.2e-16), suggesting that greater β-glucosidase activity was associated with improved soil quality. Bacteria Marker FAME (BACSUM) had a moderate positive correlation with SMAFSQI (ρ = 0.59, p < 6.14e-08), proposing that increasing bacterial FAME could lead to an increase in soil quality. Similarly, BGLU and BACSUM were moderately correlated with each other (ρ = 0.54, p < 1.54e-06). Each relation had a p-value of < 0.05, indicating that each correlation was statistically significant (Figure 2).
Figure 2
Code
# Kruskal-Wallis rank sum testkruskal.test(SMAFSQI ~ LAND_USE, data = data_clean)
Kruskal-Wallis rank sum test
data: SMAFSQI by LAND_USE
Kruskal-Wallis chi-squared = 56.642, df = 5, p-value = 5.993e-11
Code
kruskal.test(BGLU ~ LAND_USE, data = data_clean)
Kruskal-Wallis rank sum test
data: BGLU by LAND_USE
Kruskal-Wallis chi-squared = 43.918, df = 5, p-value = 2.407e-08
Code
kruskal.test(BACSUM ~ LAND_USE, data = data_clean)
Kruskal-Wallis rank sum test
data: BACSUM by LAND_USE
Kruskal-Wallis chi-squared = 35.721, df = 5, p-value = 1.08e-06
Code
# Kruskal-Wallis rank sum testkruskal.test(SMAFSQI ~ LTEXP, data = data_clean)
Kruskal-Wallis rank sum test
data: SMAFSQI by LTEXP
Kruskal-Wallis chi-squared = 52.656, df = 2, p-value = 3.68e-12
Code
kruskal.test(BGLU ~ LTEXP, data = data_clean)
Kruskal-Wallis rank sum test
data: BGLU by LTEXP
Kruskal-Wallis chi-squared = 41.85, df = 2, p-value = 8.175e-10
Code
kruskal.test(BACSUM ~ LTEXP, data = data_clean)
Kruskal-Wallis rank sum test
data: BACSUM by LTEXP
Kruskal-Wallis chi-squared = 25.8, df = 2, p-value = 2.498e-06
Kruskal-Wallis tests indicated that BGLU, BACSUM, and SMAFSQI significantly varied across different land use practices (LAND_USE) and management treatments (LTEXP). Across land use practices, SMAFSQI differed significantly for each practice (p = 5.99e-11), along with BGLU (p = 2.41e-08) and BACSUM (p = 1.08e-06) (Figure 3). Similarly, significant variability was found across management treatments for SMAFSQI (p = 3.68e-12), BGLU (p = 8.18e-10), and BACSUM (p = 2.50e-06) (Figure 3). The test results support the assumption that management strategies affect both soil properties and soil quality.
Figure 3
Code
log_visualization <-ggplot(data_clean, aes(x = BGLU, y = BACSUM)) +geom_point(aes(color = SMAFSQI)) +geom_smooth(method ="lm") +scale_color_viridis_c() +# Apply log transformations to the x and y axesscale_x_log10() +scale_y_log10() +theme_linedraw() +theme(legend.position ="bottom") +labs(title ="Beta-Glucosidase Activity vs Bacteria-marker FAME vs Soil Quality Index Score", x ="Beta-Glucosidase Activity", y ="Bacteria-marker FAME",color ="Soil Quality Index Score")ggsave("images/log_visualization.png", plot = log_visualization, width =8, height =7, units ="in")
`geom_smooth()` using formula = 'y ~ x'
The scatterplot of BGLU and BACSUM, whose points were colored by SMAFSQI, visually represented these correlations. The plot showed a gradient of soil quality scores. The higher soil quality scores generally clustered where both BGLU and BACSUM were high. Whereas the lower soil quality scores were gathered in areas of low BGLU and BACSUM (Figure 4). Signifying that BGLU and BACSUM could be used to predict the soil quality score.
Figure 4
Code
set.seed(123)data_split <-initial_split(data_clean, prop =0.8)train_data <-training(data_split)test_data <-testing(data_split)cv <-vfold_cv(train_data, v =10)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 1.98
2 rsq standard 0.931
3 mae standard 1.68
Code
# Create a plot of the observed vs predicted values with clear title, axis labels, and a compelling color scalefinal_model_fit <-ggplot(final_preds, aes(x = SMAFSQI, y = .pred, colour = SMAFSQI)) +scale_color_viridis_c() +geom_point() +geom_abline() +theme_linedraw() +labs(title ="Observed vs Predicted Soil Quality Index Score",x ="Observed",y ="Predicted")ggsave("images/final_model_fit.png", plot = final_model_fit, width =10, height =7, units ="in")
Among the five machine learning models evaluated, the nearest neighbor model showed the best performance. The final fitted model had an RMSE of 1.98 and an R² of 0.93. These results suggest a strong relationship between BGLU and BACSUM as predictors for SMAFSQI. The observed vs. predicted scatterplot (Figure 5) exhibited a clustering of data points along the 1:1 reference line, thus indicating a very strong model fit with minimal deviation.
Figure 5
Discussion/Conclusion
Our main findings support the hypothesis that crop-livestock integration and the use of cover crops result in significantly higher soil quality index scores in rainfed Northern Great Plains systems. These conservation practices stimulate microbial activity and enhance soil biological structure, offering a resilient path forward for sustainable dryland agriculture. Diversification and cover cropping support soil function and microbial activity. With these conservation practices, the increase of BGLU and BACSUM is observed, leading to the conclusion that the two variables can be used to predict the soil quality score. Additional findings from our research tell us that fallow-based methods are linked to a decrease in microbial activity. This is a method that is widely utilized in agricultural areas, as it restores organic matter, but this restoration ultimately falls below the performance of continuous cropping and diversification.
Other research from (Miner et al. (2020)) suggests that environmental benefits from crop diversity are regionally specific, with agroecosystems globally being very different. Due to this, it cannot be said for all agricultural systems that diversification and the use of cover crops will increase yield production and SQIs. However, for the rainfed croplands of Bismarck, North Dakota, this is the expected outcome. Our data and study suggest that biological pathways, like microbial stimulation from plant diversity, could support yield improvements. This microbial stimulation from plant diversity is supported by others, as Jalloh states, as it is associated with decomposition, carbon utilization, and nitrogen fixation, all key components of helping the stability of an ecosystem (Jalloh et al. (2024)).
The major limitation of this study was the small sample size of the data we had to work with. There were many difficulties at the beginning of our study that led us to change the scope of the study itself. Due to this small sample size, we were unable to see the desired results through our preliminary testing. Many tests fell short, specifically correlation tests, with there simply not being enough data in the originally used datasets. Because of this issue, we were forced to change our original hypothesis to align with newly found datasets. Another issue we encountered with the datasets was the time scale they offered. While one dataset we used had sufficient data for multiple years, other datasets with very similar data would have inputs from only one year. This lack of consistency proved to be more of a challenge for us than we anticipated.
For future studies, we recommend a larger sample size. As well as expanding the study outside of the small cropland in North Dakota. The smaller spatial scale was beneficial for our study, as we had limited time to collect data and work with it, but to determine which conservation practices would be beneficial for other types of biomes and ecosystems, more studies specific to different locations need to be done, as one method that works for one ecosystem may not work for another.
This study supports shifting from traditional practices to conservation practices like cover cropping and diversified rotations that maintain or improve SQIs for long-term sustainability. Although concerns for crop yield and production are widespread, yield efficiency is likely to increase as the soil quality and health improve. The SQIs can be utilized to determine what will help soil health and function in an agricultural system, which will, in turn, help the production and efficiency of crops.
References
Belete, T., & Yadete, E. (2023). Effect of mono cropping on soil health and fertility management for sustainable agriculture practices: A review. Plant Sci, 11, 192–197.
Cardoso, E. J. B. N., Vasconcellos, R. L. F., Bini, D., Miyauchi, M. Y. H., Santos, C. A. dos, Alves, P. R. L., Paula, A. M. de, Nakatani, A. S., Pereira, J. de M., & Nogueira, M. A. (2013). Soil health: Looking for suitable indicators. What should be considered to assess the effects of use and management on soil health? Scientia Agricola, 70, 274–289.
Chaudhry, H., Vasava, H. B., Chen, S., Saurette, D., Beri, A., Gillespie, A., & Biswas, A. (2024). Evaluating the soil quality index using three methods to assess soil fertility. Sensors, 24(3), 864.
Jalloh, A. A., Khamis, F. M., Yusuf, A. A., Subramanian, S., & Mutyambai, D. M. (2024). Long-term push–pull cropping system shifts soil and maize-root microbiome diversity paving way to resilient farming system. BMC Microbiology, 24(1), 92.
Katherasala, S., Bheenaveni, R. S., Chinthakindi, P., Vadlakonda, D., Kummari, N., & Bandi, S. S. (2024). Unveiling the detrimental impacts of intensive chemical use and monoculture on soil health and sustainable development goal 15: Life on land in telangana. Journal of Lifestyle and SDGs Review, 4(2), 10–47172.
Liebig, M. A., Acosta-Martinez, V., Archer, D. W., Halvorson, J. J., Hendrickson, J. R., Kronberg, S. L., Samson-Liebig, S. E., & Vetter, J. M. (2022). Conservation practices induce tradeoffs in soil function: Observations from the northern great plains. Soil Science Society of America Journal, 86(6), 1413–1430.
Miner, G. L., Delgado, J. A., Ippolito, J. A., & Stewart, C. E. (2020). Soil health management practices and crop productivity. Agricultural & Environmental Letters, 5(1), e20023.
Worley, N. (2024). The truth of industrial agriculture. The Chanterelle, 1(1), 9.
Zhao, F., Wu, Y., Hui, J., Sivakumar, B., Meng, X., & Liu, S. (2021). Projected soil organic carbon loss in response to climate warming and soil water content in a loess watershed. Carbon Balance and Management, 16, 1–14.
Source Code
---title: Final Projectauthors: - name: Avery Eastman affiliation: CSU roles: writing corresponding: true - name: Ava Zelenz affiliation: CSU roles: writing corresponding: falsebibliography: references.bibcsl: apa.cslformat: html: code-fold: true code-tools: trueexecute: echo: true---# **Sustainable Soil Management Practices and Impacts**## AbstractIndustrial agriculture is threatened by a changing climate. With the current methods under threat and the environmental factors becoming increasingly difficult to mitigate, solutions are needed. One area of agriculture that is taking heavy damage is soil health. With pre-existing farming methods like tilling, fertilizer, and pesticide use, soil health is in steady decline. In this study, we identify the croplands of Bismarck, North Dakota, and determine how sustainable agriculture methods affect soil quality. Using indicators like soil quality index scores (SQIs), Beta-Glucosidase Activity (BGLU), and Bacteria Maker FAME (BACSUM), we utilized a machine learning workflow to determine how the practices affect soil health. Our results suggest that cover cropping, diversified rotations, and grazed continuous cropping support better soil health compared to fallow-based methods. Traditional rotation practices do not show an increase in soil recovery and restoration of organic matter. While these practices are not detrimental to existing agricultural land, the switch to more sustainable alternatives is imperative for adapting to the outcomes of climate change. Recognizing how soil health responds to different management practices can serve as a starting point for further research in different agrosystems outside of North Dakota.## Introduction/HypothesisAs climate change worsens, attention to industrial agriculture increases. The specific methods of the industry, like tilling, monocropping, and fertilizer and pesticide use, pose a threat to soil health (@belete2023effect). These practices, while they may be efficient and produce steady yields, have further environmental implications that are not sustainable in the long run (@worley2024truth). Soil health is critical to food and water security, as well as overall ecosystem function. In ecosystems, soil organic carbon (SOC) is fundamental in the processes of water permeation, nutrient retention, and aggregate stability. As Zhao et al. state, “Even a small change in the SOC can substantially affect not only the climate but also the stability of ecosystems, because of its decisive role in the exchange of carbon between the soil and atmosphere and plant growth/food production”(@zhao2021projected). Studies have also supported that areas with less anthropogenic activity show low bulk density within the soil. (@cardoso2013soil). Additionally, areas with less human activity tend to have a better accumulation of soil particles, therefore a better structure within the soil. This improved structure is vital for all ecosystems as it allows the permeability for water, air, and roots to be eased, helping them thrive.There needs to be a shift to sustainable agriculture methods to restore and protect soil health. Sustaining soil health while simultaneously sustaining crop yields is the goal that the agriculture industry should be working towards. In addition, (@katherasala2024unveiling) emphasize the importance of educating farmers on the harmful effect of current practices and alternative methods. Despite the need for sustainable practices, there is cause for concern and uncertainty towards transitioning to more sustainable agriculture, with most concerns being centered around changes in crop yields and production efficiency. However, (@miner2020soil) provides research showing that under no-tilling conditions, yield declines can be partially mitigated when no-till conditions are combined with residue retention and crop rotation practices. These concerns can also be minimized with the use of soil quality indices (SQIs). SQIs are utilized to measure and monitor soil function and health through chemical and biological properties (@chaudhry2024evaluating). With practices like crop rotation, residue retention, cover cropping, and livestock integration the SQIs can be determined to monitor soil health with these treatments and ensure it is beneficial for agricultural development. This study looked at different models to determine if crop rotation systems that integrate livestock and cover crops in the Northern Great Plains will result in significantly higher soil quality index scores in relation to microbial activity.## MethodsThe data used for our research was collected in the USDA-ARS Northern Great Plains Research Lab from the years 2014-2016. We accessed this data through [data.gov](http://data.gov) from the metadata set called “Conservation Practices Induce Tradeoffs in Soil Function: Observations from the Northern Great Plains” (@liebig2022conservation). We downloaded and utilized two of its dataset. The first, SES_On_Station_Intergrative Measures_SMAF, contained different soil scores across different land use types and treatment systems. While the second one, SES_On_Station_Soil_Properties_Near Surface, included measurements for various soil properties for each land use and treatment type. Both of which reflected near-surface (0-5 cm) depth data. After downloading the datasets, we created a new project in RStudio. Once created, we read in each csv file using the read_csv() function, saving each raw dataset to an object. The datasets were then inspected for any shared columns using the glimps() function. We were then able to join the two datasets by their overlapping columns "STUDY", "LTEXP", "TRT", "LAND_USE", "YEAR", "REP", "DEPTH".Through Exploratory Data Analysis, we were able to understand the merged data structure. This included identifying outliers, creating visualizations, and getting summary statistics. First we used glimps() to check data types, dimensions, and missing values. To further understand what values were missing. There were two columns that showed a high percentage of missing values, TOC and VESS, to address this we filtered the columns out and created a new dataset with the cleaned data. Then, to view some statistical summaries on our data we used the skim() function, allowing us to view the general trend in mean, standard deviation, and skewness for each variable. As a preprocessor step for correlation analysis, we created a new dataset with only numeric variables. This separation of numeric data allowed us to narrow down the variables that had a strong correlation to the soil quality index score (SMAFQI) variable and to each other. Both vis_cor() and cor() were used for this process. Due to their high correlation with each other, WSA, UREASE, SMAFQI, BGLU, TFAME, PHOS, and BACSUM were all selected to do a separate correlation analysis on. We then visualized each variable’s distribution using the gghistogram() function. Both Beta-Glucosidase Activity (BGLU) and Bacteria Maker FAME (BACSUM) were then selected to complete our remaining research on.For categorical analysis, we used a box plot, density plot, and violin plots to visualize the relationship between each SMAFQI, BGLU, and BACSUM and treatment (LAND_USE) or management type (TRT). Next, to test for normality in SMAFQI, BGLU, and BACSUM, we used the Shapiro-Wilk normality test. The results informed us that our data did not follow a normal distribution. To help normalize the data, we log-transformed each variable. To verify that the log transformation worked, we used the Shapiro-Wilk normality test again, using the transformed data. Additionally, we viewed the transformation visually using gghistogram(). Both BGLU and BACSUM appeared normally distributed, while SMAFQI still remained skewed. Thus, Pearson's correlation test was used to test BGLU and BACSUM individually, whereas the Spearman test was used to assess the correlation of the variables without log transforming them. Furthermore, to ensure that SMAFQI, BGLU, and BACSUM were correlated to the different management practices, the Kruskal-Wallis rank sum test was used. To visualize the log-log relationship between BGLU and BACSUM, and view how well both variables predict SMAFQI change, an XY plot of BGLU and BACSUM was made. The scale_color_viridis_c() function was used to color the points by SMAFQI.We then began a machine learning workflow. First, we set a seed for reproducibility and split the data into an 80% training and 20% testing set. Additionally, a 10-fold cross-validation dataset was created. To preprocess the data recipe() was used to define a series of data preprocessing steps. Next, five different models were built, a workflow_set() was created using the 5 models, and each one was fit to the training data. Their performance was compared using autoplot() and rank_results(). The highest performing model, which was the nearest neighbor model, was then used to build a final workflow on the full training data. The augment() function was then used to make predictions on the test data. To evaluate the final model’s performance metrics() was used and an observed vs predicted scatter plot was made.## Results```{r, message=FALSE, warning=FALSE, results: hide}# Load and inspect datalibrary(tidyverse)library(tidymodels)integrative_measures_raw <- read_csv("/Users/Avery/github/Final-Project/data/SES_On-Station_Integrative_Measures_SMAF.csv")soil_properties_raw <- read_csv("/Users/Avery/github/Final-Project/data/SES_On-Station_Soil_Properties_Near_Surface.csv")glimpse(integrative_measures_raw)glimpse(soil_properties_raw)``````{r, message=FALSE, warning=FALSE, results: hide}# Merge datasetlibrary(dplyr)merged_data <- left_join(integrative_measures_raw, soil_properties_raw, by = c("STUDY", "LTEXP", "TRT", "LAND_USE", "YEAR", "REP", "DEPTH"))``````{r, message=FALSE, warning=FALSE, results: hide}# Understand Data Structure – Check types, dimensions, missing valuesglimpse(merged_data)``````{r, message=FALSE, warning=FALSE, results: hide}# Identify Data Issues – Outliers, missing data, inconsistencieslibrary(visdat)vis_dat(merged_data)vis_miss(merged_data)``````{r, message=FALSE, warning=FALSE, results: hide}# Clean datadata_clean <- merged_data |> select(-VESS, -TOC)``````{r, message=FALSE, warning=FALSE, results: hide}# Summary Statistics – Mean, median, variance, skewness, kurtosissummary(data_clean)library(skimr)skim(data_clean)``````{r, message=FALSE, warning=FALSE, results: hide}# Separating numeric data into a new datasetnumeric_data <- data_clean |> select(where(is.numeric))vis_cor(numeric_data)``````{r, message=FALSE, warning=FALSE, results: hide}data_clean |> select(WSA, UREASE, SMAFSQI, BGLU, TFAME, PHOS, BACSUM) |> cor()``````{r, message=FALSE, warning=FALSE, results: hide}data_clean |> select(SMAFSQI, BGLU, BACSUM) |> cor()``````{r, message=FALSE, warning=FALSE, results: hide}library(ggpubr)ggarrange(gghistogram(data_clean$WSA, title = "WSA", bins = 20), gghistogram(data_clean$UREASE, title = "UREASE", bins = 20), gghistogram(data_clean$SMAFSQI, title = "SMAFSQI", bins = 20), gghistogram(data_clean$BGLU, title = "BGLU", bins = 20), gghistogram(data_clean$TFAME, title = "TFAME", bins = 20), gghistogram(data_clean$PHOS, title = "PHOS", bins = 20), gghistogram(data_clean$BACSUM, title = "BACSUM", bins = 20))``````{r, message=FALSE, warning=FALSE, results: hide}# boxplot plot of soil quality index by long-term experiment ggplot(data_clean, aes(x = LTEXP, y = SMAFSQI)) + geom_boxplot(fill = "skyblue", color = "black") + theme_minimal() + labs( title = "Soil Quality Index by Management Practice", x = "Management Practice (LTEXP)", y = "SMAFSQI" ) + theme(axis.text.x = element_text(angle = 45, hjust = 1))# density plot of variables by management practice/treatment typeggplot(data_clean, aes(x = SMAFSQI, fill = LTEXP)) + geom_density(alpha = 0.5) + theme_minimal() + labs( title = "Density of SMAFSQI by Management Practice", x = "Soil Quality Index (SMAFSQI)", fill = "Management Practice" )ggplot(data_clean, aes(x = BGLU, fill = LTEXP)) + geom_density(alpha = 0.5) + theme_minimal() + labs( title = "Density of SMAFSQI by Management Practice", x = "BGLU", fill = "Management Practice" )ggplot(data_clean, aes(x = BACSUM, fill = LTEXP)) + geom_density(alpha = 0.5) + theme_minimal() + labs( title = "Density of SMAFSQI by Management Practice", x = "BACSUM", fill = "Management Practice" )# violin plotsSMAFSQI_type_violin <- ggviolin(data_clean, x = "TRT", y = "SMAFSQI", add = "boxplot", color = "TRT")BGLU_type_violin <- ggviolin(data_clean, x = "TRT", y = "BGLU", add = "boxplot", color = "TRT")BACSUM_type_violin <- ggviolin(data_clean, x = "TRT", y = "BACSUM", add = "boxplot", color = "TRT")SMAFSQI_landuse_violin <- ggviolin(data_clean, x = "LAND_USE", y = "SMAFSQI", add = "boxplot", color = "LAND_USE")BGLU_landuse_violin <- ggviolin(data_clean, x = "LAND_USE", y = "BGLU", add = "boxplot", color = "LAND_USE")BACSUM_landuse_violin <- ggviolin(data_clean, x = "LAND_USE", y = "BACSUM", add = "boxplot", color = "LAND_USE")library(patchwork)combined_violins <- SMAFSQI_landuse_violin / BGLU_landuse_violin / BACSUM_landuse_violinggsave("images/combined_violins.png", plot = combined_violins, width = 18, height = 23)```Violin plots (Figure 1) further showed how BGLU, BACSUM, and SMAFSQI were spread out across LAND_USE. In near-surface soil, grazed and ungrazed continuous cropping showed high median values and narrower distributions for SMAFSQI. Likewise, for BGLU and BACSUM, grazed continuous cropping promoted higher β-glucosidase activity and bacterial activity. Hence, Figure 5 illustrates that continuous cropping, especially when grazed, consistently supports higher soil quality. In contrast, across all three indicators (BGLU, BACSUM, and SMAFSQI), spring wheat-fallow systems displayed the lowest medians with a wider spread. Signifying that fallow-based systems may reduce microbial activity and soil quality.```{r, message=FALSE, warning=FALSE, results: hide}# Assessing Normalityshapiro.test(data_clean$SMAFSQI)shapiro.test(data_clean$BGLU)shapiro.test(data_clean$BACSUM)``````{r, message=FALSE, warning=FALSE, results: hide}# normality test on log transformed variables shapiro.test(log(data_clean$SMAFSQI))shapiro.test(log(data_clean$BGLU))shapiro.test(log(data_clean$BACSUM))ggarrange(gghistogram(log(data_clean$SMAFSQI), title = "SMAFSQI", bins = 20), gghistogram(log(data_clean$BGLU), title = "BGLU", bins = 20), gghistogram(log(data_clean$BACSUM), title = "BACSUM", bins = 20))``````{r, message=FALSE, warning=FALSE, results: hide}# Correlation# Pearson’s correlation test, works on normally distributed datacor.test(log(data_clean$BGLU), log(data_clean$BACSUM))# Spearman correlation test, works on non-parametric and non-linear data setscor.test(data_clean$BGLU, data_clean$BACSUM, method = "spearman")cor.test(data_clean$BGLU, data_clean$SMAFSQI, method = "spearman")cor.test(data_clean$SMAFSQI, data_clean$BACSUM, method = "spearman")```Spearman’s rank correlation revealed moderately strong relationships between the variables BGLU, BACSUM, and SMAFSQI. Beta-Glucosidase Activity (BGLU) had a strong positive correlation with Soil Quality Index Score (SMAFSQI) (ρ = 0.88, p \< 2.2e-16), suggesting that greater β-glucosidase activity was associated with improved soil quality. Bacteria Marker FAME (BACSUM) had a moderate positive correlation with SMAFSQI (ρ = 0.59, p \< 6.14e-08), proposing that increasing bacterial FAME could lead to an increase in soil quality. Similarly, BGLU and BACSUM were moderately correlated with each other (ρ = 0.54, p \< 1.54e-06). Each relation had a p-value of \< 0.05, indicating that each correlation was statistically significant (Figure 2).{width="284"}```{r, message=FALSE, warning=FALSE, results: hide}# Kruskal-Wallis rank sum testkruskal.test(SMAFSQI ~ LAND_USE, data = data_clean)kruskal.test(BGLU ~ LAND_USE, data = data_clean)kruskal.test(BACSUM ~ LAND_USE, data = data_clean)# Kruskal-Wallis rank sum testkruskal.test(SMAFSQI ~ LTEXP, data = data_clean)kruskal.test(BGLU ~ LTEXP, data = data_clean)kruskal.test(BACSUM ~ LTEXP, data = data_clean)```Kruskal-Wallis tests indicated that BGLU, BACSUM, and SMAFSQI significantly varied across different land use practices (LAND_USE) and management treatments (LTEXP). Across land use practices, SMAFSQI differed significantly for each practice (p = 5.99e-11), along with BGLU (p = 2.41e-08) and BACSUM (p = 1.08e-06) (Figure 3). Similarly, significant variability was found across management treatments for SMAFSQI (p = 3.68e-12), BGLU (p = 8.18e-10), and BACSUM (p = 2.50e-06) (Figure 3). The test results support the assumption that management strategies affect both soil properties and soil quality.{width="320"}```{r, message=FALSE, warning=FALSE, results: hide}log_visualization <- ggplot(data_clean, aes(x = BGLU, y = BACSUM)) + geom_point(aes(color = SMAFSQI)) + geom_smooth(method = "lm") + scale_color_viridis_c() + # Apply log transformations to the x and y axes scale_x_log10() + scale_y_log10() + theme_linedraw() + theme(legend.position = "bottom") + labs(title = "Beta-Glucosidase Activity vs Bacteria-marker FAME vs Soil Quality Index Score", x = "Beta-Glucosidase Activity", y = "Bacteria-marker FAME", color = "Soil Quality Index Score")ggsave("images/log_visualization.png", plot = log_visualization, width = 8, height = 7, units = "in")```The scatterplot of BGLU and BACSUM, whose points were colored by SMAFSQI, visually represented these correlations. The plot showed a gradient of soil quality scores. The higher soil quality scores generally clustered where both BGLU and BACSUM were high. Whereas the lower soil quality scores were gathered in areas of low BGLU and BACSUM (Figure 4). Signifying that BGLU and BACSUM could be used to predict the soil quality score.```{r, message=FALSE, warning=FALSE, results: hide}set.seed(123)data_split <- initial_split(data_clean, prop = 0.8)train_data <- training(data_split)test_data <- testing(data_split)cv <- vfold_cv(train_data, v = 10)``````{r, message=FALSE, warning=FALSE, results: hide}soil_recipe <- recipe(SMAFSQI ~ BGLU + BACSUM, data = train_data) |> step_log(all_predictors()) |> step_interact(terms = ~ BGLU:BACSUM) |> step_naomit(all_predictors(), all_outcomes())``````{r, message=FALSE, warning=FALSE, results: hide}lm_model <- linear_reg() |> set_engine('lm') |> set_mode("regression")rf_model <- rand_forest() |> set_engine("ranger", importance = "impurity") |> set_mode("regression")bt_model <- boost_tree() |> set_engine("xgboost") |> set_mode("regression")bag_model <- bag_mlp() |> set_engine("nnet") |> set_mode("regression")nn_model <- nearest_neighbor() |> set_engine("kknn") |> set_mode("regression")``````{r, message=FALSE, warning=FALSE, results: hide}library(tidyverse)library(tidymodels)library(powerjoin)library(glue)library(vip)library(baguette)wf_set <- workflow_set(list(soil_recipe), list(lm_model, rf_model, bt_model, bag_model, nn_model)) |> workflow_map('fit_resamples', resamples = cv)autoplot(wf_set)rank_results(wf_set, rank_metric = "rsq", select_best = TRUE)``````{r, message=FALSE, warning=FALSE, results: hide}final_fit <- workflow() |> add_recipe(soil_recipe) |> add_model(nn_model) |> fit(data = train_data)final_preds <- augment(final_fit, new_data = test_data)metrics(final_preds, truth = SMAFSQI, estimate = .pred)# Create a plot of the observed vs predicted values with clear title, axis labels, and a compelling color scalefinal_model_fit <- ggplot(final_preds, aes(x = SMAFSQI, y = .pred, colour = SMAFSQI)) + scale_color_viridis_c() + geom_point() + geom_abline() + theme_linedraw() + labs(title = "Observed vs Predicted Soil Quality Index Score", x = "Observed", y = "Predicted")ggsave("images/final_model_fit.png", plot = final_model_fit, width = 10, height = 7, units = "in")```Among the five machine learning models evaluated, the nearest neighbor model showed the best performance. The final fitted model had an RMSE of 1.98 and an R² of 0.93. These results suggest a strong relationship between BGLU and BACSUM as predictors for SMAFSQI. The observed vs. predicted scatterplot (Figure 5) exhibited a clustering of data points along the 1:1 reference line, thus indicating a very strong model fit with minimal deviation.## Discussion/ConclusionOur main findings support the hypothesis that crop-livestock integration and the use of cover crops result in significantly higher soil quality index scores in rainfed Northern Great Plains systems. These conservation practices stimulate microbial activity and enhance soil biological structure, offering a resilient path forward for sustainable dryland agriculture. Diversification and cover cropping support soil function and microbial activity. With these conservation practices, the increase of BGLU and BACSUM is observed, leading to the conclusion that the two variables can be used to predict the soil quality score. Additional findings from our research tell us that fallow-based methods are linked to a decrease in microbial activity. This is a method that is widely utilized in agricultural areas, as it restores organic matter, but this restoration ultimately falls below the performance of continuous cropping and diversification.Other research from (@miner2020soil) suggests that environmental benefits from crop diversity are regionally specific, with agroecosystems globally being very different. Due to this, it cannot be said for all agricultural systems that diversification and the use of cover crops will increase yield production and SQIs. However, for the rainfed croplands of Bismarck, North Dakota, this is the expected outcome. Our data and study suggest that biological pathways, like microbial stimulation from plant diversity, could support yield improvements. This microbial stimulation from plant diversity is supported by others, as Jalloh states, as it is associated with decomposition, carbon utilization, and nitrogen fixation, all key components of helping the stability of an ecosystem (@jalloh2024long).The major limitation of this study was the small sample size of the data we had to work with. There were many difficulties at the beginning of our study that led us to change the scope of the study itself. Due to this small sample size, we were unable to see the desired results through our preliminary testing. Many tests fell short, specifically correlation tests, with there simply not being enough data in the originally used datasets. Because of this issue, we were forced to change our original hypothesis to align with newly found datasets. Another issue we encountered with the datasets was the time scale they offered. While one dataset we used had sufficient data for multiple years, other datasets with very similar data would have inputs from only one year. This lack of consistency proved to be more of a challenge for us than we anticipated.For future studies, we recommend a larger sample size. As well as expanding the study outside of the small cropland in North Dakota. The smaller spatial scale was beneficial for our study, as we had limited time to collect data and work with it, but to determine which conservation practices would be beneficial for other types of biomes and ecosystems, more studies specific to different locations need to be done, as one method that works for one ecosystem may not work for another.This study supports shifting from traditional practices to conservation practices like cover cropping and diversified rotations that maintain or improve SQIs for long-term sustainability. Although concerns for crop yield and production are widespread, yield efficiency is likely to increase as the soil quality and health improve. The SQIs can be utilized to determine what will help soil health and function in an agricultural system, which will, in turn, help the production and efficiency of crops.